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Abstract 

We describe a prototype system for multilingual gist- 
ing of Web pages, and present an evaluation method- 
ology based on the notion of gisting as decision sup- 
port. This evaluation paradigm is straightforward, 
rigorous, permits fair comparison of alternative ap- 
proaches, and should easily generalize to evalua- 
tion in other situations where the user is faced with 
decision-making on the basis of information in re- 
stricted or alternative form. 



Introduction: Gisting as Decision 
Support 

The word "gisting" has been used in a variety of set- 
tings. Informally, it simply means "getting the gist," 
that is, given some information conveyed by natural 
language, understanding some characteristic or im- 
portant aspect of that information. 

By definition, gisting is an activity in which the 
information taken into account is less than the full 
information content available. In this paper, we take 
the view that there is another key aspect of gisting 
that goes beyond simply selecting a subset of available 
information, namely the goal of supporting decision- 
making. In an environment where human beings are 
attempting to gist radio traffic, for example, radio 
operators need to decide whether or not to route in- 
form ation to electronic warfare analysts (Elsaesser 



1996). Accordingly, in order to evaluate a particu- 
lar method for gisting, one must examine the extent 
to which gisting supports a decision-making task. 

The focus of this paper is multilingual gisting on 
the World Wide Web, with particular attention to 
developing a methodology for evaluating multilingual 
gisting based on its role of decision support. We 
see such an evaluation methodology as important be- 
cause, although the real proof of any method is in how 
well it supports real users at their real- world tasks, 
studying users in fully natural settings can be dif- 
ficult to organize, and, more important, two natural 
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settings are rarely similar enough to afford a fair com- 
parison between alternative approaches to the same 
task. In order to address that problem, the method- 
ology we propose is applicable to a wide variety of 
tasks, simple to carry out and, most important, de- 
fined in enough detail that competing methods can 
be evaluated against the same set of data and the 
results compared. 

Gisting for the Web: A Simple 
Prototype 

The motivation for this line of research can be de- 
scribed quite simply. Imagine that you are brows- 
ing the World Wide Web using your favorite Web 
browser. You click a link, or conduct a search, and 
find yourself looking at the page illustrated in Fig- 
ure |[ As it happens, you don't know a word of 
Japanese. What are your options? Is it worth find- 
ing a bilingual dictionary and looking up words on 



this page? (And if so, which words?) Is it worth 
following links on this page in hopes of finding some- 
thing understandable? (And if so, which links?) Is 
it worth bothering a nearby colleague who knows the 
language and asking for a rough translation? Is it 
worth going to the time and expense of using an on- 
line service (e.g. The Global Translation Alliance, 
http: / /www.aleph.con]| ) to translate the page com- 
pletely? 

In considering possible solutions to this scenario, 
we arrived at the following principles. 

Avoid full-scale machine translation. The 

user's problem would certainly be solved by a 
fully automatic translation of the Web page un- 
der consideration. Unfortunately, the state of the 
art in high quality machine translation is typi- 
cally measured in words per m inute rather than 
pages per minute (Dorr 1996), so even if it is 
possible to obtain a translation for the page, the 
user is still faced with the decision of whether or 
not it is worth sacrificing the time to obtain it. 

Keep the human in the loop. We see the 

problem scenario as an opportunity for collabo- 
ration between person and machine, and in par- 
ticular an opportunity for the machine to facili- 
tate the user in doing things that people do well. 
For example, people are capable of disambiguat- 
ing words almost effortlessly in context, although 
this is a task at which computers currently per- 
form quite poorly; therefore it makes sense to 
have the computer present alternatives rather 
than making disambiguation decisions for itself, 
unless such decisions can be made with very high 
confidence. 

Aim for extensibility. Our emphasis is on 
modular and distributed design; for example, al- 
though we do not attempt to disambiguate words 
in order to automatically select meaning equiv- 
alents in the user's language, a disambiguation 
component could easily be added to the system 
without wholesale changes in its design. An ul- 
timate target our efforts is the dissemination of 
application programmer interfaces (APIs) that 
will make extensible infrastructure available to 
the community at large. 

With those principles in mind, we implemented a 
prototype gisting proxy, which assists users when con- 
fronted with a Web page in an unknown language]^ 



^For the moment we are glossing over who invokes the 
gisting proxy, and how. In its full generality, this proxy 
is part of a general design for a multilingual agent that is 
aware of the user's linguistic knowledge and preferences, 
and goes into action when it detects a situation where its 
capabilities might assist the user. For the current proto- 
type, we have implemented a gisting proxy HTTP server 
initiated by the user. 



When invoked for a given Web page, the gisting proxy 
behaves as follows: 

1. Convert the character encoding of the document 
into a standard encoding. 

2. Divide the Web page into structurally distinct 
pieces, using HTML markup. 

3. For each piece: 

(a) Automatically identify the natural language in 
which this piece of text is written 

(b) Invoke language-dependent word identification 
and normalization 

(c) Look up each word in an on-line bilingual dictio- 
nary 

(d) Present word-by-word glosses in the context of 
the original page 

4. Modify all links on the page so that further nav- 
igation from this point on will automatically go 
through the gisting proxy. 

Step|l| is necessary because different character en- 
codings can be used for the same language, partic- 
ularly in the case of Asian languages (e.g. EUC-JP 
vs. Shift- JIS). Normalization of character encoding 
is necessary for consistency across components of the 
system. 

Step 1^ makes it possible to analyze documents 
containing text in multiple languages. Small sub- 
document units (e.g. list items) motivate taking 
an approach to automated language identification 
(Step pa) that can work well even when the strings 
to be mentified are very short a nd cannot be r elied 
upon to contain function words (Dunning 1994). 



Depending on the language, different measures 
must be taken in order to identify words (Step ^ ). 
For example, in many Asian languages words are typ- 
ically not delimited by spaces, and the refore auto- 
matic word segmentation is necessary (Matsumoto 
1995). This contrasts with Romance languages such 



as Spanish, where words are generally delimited by 
spaces or punctuation but a small subset of the lexi- 
cal items in the language must be identified and sep- 
arated out (e.g. Spanish damelo — da + me + lo). 
In addition, some form of normalization may need 
to be done as well. For example, in order to locate 
da in a Spanish-English translation lexicon it may be 
necessary to look it up by its root form, dar (to give). 

Word-by-word lookup and presentation in this 
system (Steps ^ and ^) resemble the direct lexi- 
cal approach to machine translation i nvestigated an d 
thoroughly criticized in the 1960s (|ALPAC 19"66| ). 
Notably, however, the problem attacked by those 
early efforts was one of full scale translation, not gist- 
ing. We would contend that with the rise of the World 
Wide Web, those early solutions have finally found 
the right problem. 



In the current prototype, presentation of the 
known-language glosses for a word are guided by the 
results of the dictionary lookup. At present: 

• If the unknown language word has a single gloss in 
the dictionary, show that gloss. 

• If the unknown language word has multiple glosses 
in the dictionary, show up to n of them for some 
customizable parameter n (currently n = 3 by de- 
fault), within parentheses and separated by com- 
mas. For example, (doctor's office, clinic, dispen- 
sary). 

• If the unknown-language word is not found in the 
dictionary, then 

— Show the unknown-language word itself, if the 
character set of the language is the same as a 
language the user knows (e.g. an unknown word 
in French would be shown to someone who knows 
English, since both use the Latin-1 character 
set). 

— Show an ellipsis (...) otherwise. 

This treatment of words not appearing in the dic- 
tionary follows the general principle that users should 
be given information that might be helpful — such as 
possible cognates — but minimally distracted by un- 
familiar scripts. The present implementation reflects 
two extremes for unknown words, namely present- 
ing them as-is or leaving them out entirely, but other 
strategies are possible. 

Figure |2| shows the result of following this process 
for the page in Figure For comparison. Figure p| 
shows the same entries as they appear in an English 
version of the same business directory. 

Our current implementation of the prototype han- 
dles gisting from Japanese, French, and Spanish to 
English, though in this paper we concern ourselves 
only with Japanese-English gisting. Given the sim- 
plicity of the approach, the main limiting factor in 
adding more languages to the list is the availability 
of bilingual dictionaries, though we expect that this 
problem may be ameliorated to some extent by auto- 
matic algorithm s for acquisition of bilingual lexicons 
( [Melamed 1996| ). 



Evaluation Design Criteria 

The gisted text that appears in Figure ^ bears little 
resemblance to an English translation of the Japanese 
content in Figure |l|. However, it does provide enough 
information to support two critical decisions facing 
the user who has arrived at that page: 

• Deciding whether a link is worth following 

• Deciding whether some text is worth having trans- 
lated 

A user interested in, say, podiatrists, can discern from 
the gisted text in Figureg that the first entry in the 



Health category is probably not worth navigating fur- 
ther. Similarly, someone interested in medical equip- 
ment manufacturers might well decide that the third 
entry is worth translating, especially if they have a 
particular interest in companies in Osaka. 

The central issue of this paper is how to evaluate 
the extent to which a gisting method helps the user to 
make decisions of this kind. In designing a method- 
ology for answering that question, we were guided by 
the following criteria: 



Approximate real Web-based decision 
tasks. Since we have characterized the role of 
gisting in terms of decision support, what must 
be evaluated is the extent to which gisted mate- 
rial facilitates decisions that resemble the choices 
available to the user when faced with multilin- 
gual content on a Web page. This considera- 
tion led us to select a categorization paradigm, 
since both the real world tasks involve a tradeoff 
between the time invested in assessing relevance 
and the accuracy of the decision as well as the 
need to select an appropriate action based on 
that assessment. 

Minimize a priori biases. Users seeking in- 
formation on the Web are seldom given a pithy 
description of a topic by someone else. There- 
fore it is important, in designing the experimen- 
tal task, to allow users to form their own inter- 
nal characterization of a topic or category, rather 
than pre-assigning category labels that incorpo- 
rate the experimenters' perceptions or biases. 

Make the task easy to create. It is hoped 
that the methodology proposed here can serve 
as an outline for other experimenters investigat- 
ing multilingual gisting, spoken language gisting, 
translation, summarization, and related topics. 
Therefore we aim for an experimental design that 
requires little in the way of specialized appara- 
tus, preparation, and the like. 



The experimental design, adopting these criteria, 
is relatively straightforward. We define a task in 
which all subjects are faced with the same catego- 
rization problem, but some of those subjects are given 
materials in English to categorize while other subjects 
are given the same content to categorize but in the 
form of gisted text. If the subjects given gisted ma- 
terials make similar decisions to the subjects given 
the English materials (allowing for normal variabil- 
ity in people's judgments), we can conclude that the 
quality of the gisting is reasonable. The next section 
gives the details of the experiment, including a way 
to assess the results quantitatively. 
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[jp] Health (Uaiishij Kammishi ) dentistfy (doctor's office (surgery), climCj dispensary) , (Wakayama-ken , prefecture in the Kinki area ) 
, Japan 

ordinal (2, two ) ... permanent tooth (in, inn ) plant medical treatment (false to otli, denture ) ... (trouble, wony, distress ) ,,. 

(cancellationj liquidation) 

office ,,,, (Shiga-ken J prefecture in the Kinki area) , Japan 

(health, sound, wholesome ) ... (manure, nightsoil, dung) (tea, Cha ) (once, onetime, on one occasion) (with te- form verb) 

pleasedo forme ... 

... (public company, corporation) , (Oosaka-fUj Osaka (Oosaka) prefecture (metropolitan area) ) , Japan 

(health, sound, wholesome ) (medical care, medical treatment ) machineiy and tools ... (wdfare, well-being) maciiineiy and tools 

(m2nufactur£, production) sale 



Figure 2: Gisted items from Nihongo Yellow Pages 
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Uenishi Dental Office is now listed. Wakayania, Japan 
Dental Implant disolves your dissatisfaction of the false teeth 

Office Inoue is now listed. Shiga, Japan 

Try our healthy diet tea "Ultra Slim Tea"from the USA! 



Mitmi Engineering & Shipbuilding Co.,Ltd. is now listed. Osaka, Japan 
We are manufacturing and distributing medical equipments for healthy life. 



Figure 3: Corresponding English items from Nihongo Yellow Pages 
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Materials 



Experimental items were selected from the Nihongo 
Yellow Pages (NY P), a business directory site o n the 
World Wide Web ( [Nihongo Yellow Pages 19961 ). The 
site was chosen because it contains information across 
a variety of topic areas, because each business direc- 
tory listing consists of a concise and informative de- 
scription, and because most listings are available in 
both Japanese and English. In our experiments we 
used listings from NYP's Education, Finance, What's 
New, Entertainment, and Health categories, selecting 
a total of 73 business listings at random from those 
areas. 

For each of these listings we created a 3 x 5-inch 
index card with a business advertisement in English 
and a corresponding card with a "gisted" version of 
the same content as expressed in Japanese. By way 
of illustration. Figure ^ shows three items in English, 
with their corresponding gisted items appearing in 
Figure ||. 

Procedure 

Creating Topical Categories In order to create 
topical categories in an objective way, we randomly 
selected 32 of the 73 English cards and gave them 
to 3 different subjects]^ with instructions to sort the 
cards "into 4-6 piles of roughly equal size, placing 
cards in the same pile when you think they should 
'go together', for example because they are related 
to similar topics." One subject created 4 piles, an- 
other 6, and the third 7 piles. We chose the 6 piles 
created by the second subject as defining the topical 
categories for the remainder of the study, noting that 
the topic distinctions made by the three subjects were 
qualitatively similar overall.R 

Categorization Task: The Control Condition 

A set of 6 subjects participated in the control condi- 
tion of the experiment. The procedure had two parts 
(see Figure^). 

1. First, subjects were presented with the 6 piles of 
English cards created as described above. They 
were asked to read through each pile and decide 
"what you think each one is about." As a memory 
aid, subjects were encouraged to write a description 
of their choosing on a Post-It note for each pile, and 
place the note next to the corresponding pile. 



^All subjects in this experiment were employees of Sun 
Microsystems in Chelmsford, Massachusetts, solicited as 
volunteers. All were fluent in English and nobody who 
saw Japanese materials was at all familiar with Japanese. 

^As an additional piece of information, we had each 
subject write a short description of the topic for each pile, 
though those descriptions were not used in the study. 



2. Having formed their own impression of the 6 topical 
categories, subjects in the control condition were 
now given 32 new randomly-selected cards in En- 
glish. They were instructed that for each new card, 
they should decide in which of the 6 categories it 
"belongs" and place it next to the corresponding 
pile. They were also given the option of placing 
cards in a seventh "none of the above" category. 

Subjects were told to take as long as they liked on 
both parts of the categorization task, though Part 2 
was timed for possible future use of that information. 

Categorization Task: The Experimental Con- 
dition A set of 8 subjects participated in the exper- 
imental condition. Part 1 of the experimental condi- 
tion was completely identical to Part 1 of the con- 
trol condition: subjects looked at exactly the same 6 
piles of English cards and formed their own mental 
description of each topical category, writing down a 
short description as a memory aid. 

Part 2 was also identical, with one crucial excep- 
tion: instead of being given cards in English to place 
into categories, subjects were given the corresponding 
gisted Japanese cards. 

Categorization Task: Random Baseline In or- 
der to obtain a lower bound for performance on this 
task, the computer did 8 runs placing the gisted 
Japanese cards into the 7 categories at random. We 
also computed lower bounds with the computer mak- 
ing a forced choice, i.e. not allowing random selection 
to pick the "none of the above" category; the results 
differed negligibly. 

Analysis 

The categorization data gathered in the experiment 
wer e analyzed following the method of Hripcsak et 
al. ( Hripcsak et al. 1995 ). In their study, they com- 
pared the performance of physicians, laypersons, and 
several computer programs on the task of classify- 
ing chest radiograph reports according to the pres- 
ence or absence of 6 medical conditions. Our adap- 
tation of their analysis is almost completely direct, 
with subjects in the control condition (English cards) 
corresponding to the physicians, subjects in the ex- 
perimental condition (gisted cards) corresponding to 
laypersons, and each run of our random baseline cor- 
responding to a subject in their baseline conditions 
(simple keyword-based classification). 

The basic idea in the analysis is to compute the 
"distance" between subjects on the basis of their cat- 
egorization behavior, and seeing whether the aver- 
age distance between an experimental subject and the 
members of the control group is greater than the av- 
erage distance of control group members from each 
other. We compute the distance dijk between two 
subjects j and k for experimental item i as the num- 
ber of topical categories where the subjects disagreed 
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Figure 4: Categorization of new items 



for this item, i.e. if they placed item i into the same 
category and 2 if they did not.^ The overall distance 
from subject j to subject k is then just their average 
distance across all N items: 



N 



k/N. 



(1) 



The main figure of interest in this study is how 
much the categorization behavior of subjects in the 
experimental (gisted cards) condition differs from be- 
havior of subjects in the control (English cards) con- 
dition. The average distance from a gisted-card sub- 
ject to the English-card subjects is 



J 

E 



djk/J 



(2) 



where J is the number of English-card subjects. The 
corresponding average distance for English-card sub- 
jects is computed similarly, though naturally the av- 
eraging excludes the distance of each subject from 
himself or herself: 



E 

i<]<j.j^i 



d,i/{j-i). 



(3) 



Hripcsak et al. also give a method for computing 
confidence intervals for these figures. In addition 
they point out that the analysis holds equally well 
for other inter-rater distance measures such as Co- 
hen's K, though they comment that in their study 
Cohen's n and the above distance measure produced 
essentially the same results. 

*This distance measure was used because Hripcsak et 
al. included the more general case of allowing an item 
to be placed into multiple categories, i.e. in their case 
distance could range from to 6. 




Figure 5: Left to right: English condition, Gisted 
condition. Random condition 



Results Fig|| shows, for each subject, a point (and 
95% confidence interval), representing its distance on 
average from the judgments of the subjects in the 
English-card (control) condition. (Recall that dis- 
tances range from to 2.) As one should expect, the 
categorization behavior of subjects given degraded in- 
formation (gisted cards) is far closer to the control 
group than random choice, but generally appears ap- 
pears to differ from that of subjects in the control 
group, who were given full information in the form of 
English cards. 

We plan to replicate the study with a greater num- 
ber of subjects, in order to better assess the signifi- 
cance of the variability that appears within the con- 
trol group — in particular, whether the degree of vari- 
ance in the control group, suggested by comparatively 



greater distances for the 4th and 5th subjects, wih 
turn out to be present or not given a larger sample. In 
addition, it has been suggested that an additional, in- 
formative control in this experiment would be a group 
that performed the experiment using cards entirely in 
Japanese (for both the topical "piles" and the cards 
to be categorized); the materials for this condition 
are easily created, but our ability to perform the ex- 
periment will depend upon the availability of subjects 
who are fluent in Japanese. 

Discussion 

Our central concern in this paper is not the method 
used for gisting — though of course that is also of 
interest — but rather the evaluation methodology we 
have designed. Were we to extend the gisting pro- 
totype, for example by improving dictionary cover- 
age, adding automatic disambiguation, or manipu- 
lating word order, the value added by those changes 
could be measured simply and effectively by adding 
a condition to the above experiment in which sub- 
jects received cards with the putatively improved in- 
formation. Similarly, anyone else's method for con- 
veyi ng the content of Japanese Web pages (e.g. Tem- 
ple, (Vanni & Zajac to appear)) can be evaluated in 
terms of its value for gisting (i.e. decision support) 
simply by creating the corresponding materials from 
the same Japanese items we used to produce gisted 
cards in our experiment. If one method for produc- 
ing gists is better than another, then subjects given 
that information should behave closer to the "ideal" 
case (defined here by the behavior of subjects who 
receive information in English) , as assessed quantita- 
tively by the distance measure. Additional measures 
might also be brought into play, such as a compari- 
son of the time it takes to make decisions given vari- 
ant forms of information, or differences in the time- 
accuracy tradeoff that results when time limitations 
are imposed. 

The evaluation methodology we have proposed 
generalizes easily to any number of other tasks that 
have similar characteristics, namely domains in which 
restricted or alternate-form information is used in 
support of a decision-making because of limits on 
time, space, or user knowledge. Some examples: 

• In environments where text summarization is used 
to decide the disposition of full documents, e.g. 
routing of memoranda or scientific articles, this 
methodology could be used to evaluate the qual- 
ity of summaries. 

• In environments where key elements are extracted 
from a stream of speech input, e.g. automatic mon- 
itoring of radio traffic, this methodology could be 
used to evaluate the extraction technology. 

• In environments where decisions are made on the 
basis of text-to-speech output, e.g. spoken lan- 



guage interfaces, this methodology could be used 
to evaluate the clarity of the speech synthesizer. 

• In environments where alternative versions of text 
or images can be presented, e.g. the selection of 
Web-based advertising based on client bandwidth, 
this methodology could be used to assess the im- 
pact of the advertisement format on users' interest 
level. 

We will be happy to make our experimental ma- 
terials available to other researchers on request. 
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